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REMARKS 

Pending Claims 

Claims 21-34 are pending in this application. Claims 21, 25, 27, 28 and 30-34 have been 
withdrawn by the Examiner. Claims 22-24, 26 and 29 are currently being examined in this application. 

Restriction Requirement 

Applicants reiterate the election, with traverse, of the claims of Group 70 (claims 22-24, 26 and 
29) for prosecution in this application. Applicants reserve the right to prosecute non-elected subject 
matter in subsequent divisional applications. 

Rejoinder 

Applicants reiterate that the method claims 30-32, which depend from the product claims of 
Group 70 should be rejoined and examined. The Examiner's attention is directed to the 
Commissioner's Notice in the Official Gazette of March 26, 1996, entitled "Guidance on Treatment of 
Product and Process Claims in Light of In re Ochiai, In re Brouwer and 35 U.S.C. § 103(b)" which 
sets forth the rules, upon allowance of product claims, for rejoinder of process claims covering the 
same scope of products. Therefore, upon allowance of any of the claims within Group 70, i.e. claims 
22-24, 26 and 29, the method claims 30-32, which depend therefrom, should be rejoined and 
examined. 

The Enablement rejection under 35 U.S.C. § 112, first paragraph 

Claims 22-24, 26 and 29 have been rejected under 35 U.S.C. § 112, first paragraph for 
alleged lack of enablement of the variant polypeptides recited in the claims. In particular, the Office 
Action alleges that the specification "is enabling only for claims limited to polynucleotides encoding 
polypeptides represented by SEQ ID NO:5 and polynucleotides represented by SEQ ID NO:70 
because the specification does not reasonably provide enablement for polynucleotides encoding 
polypeptide variants having at least 90% sequence identity to SEQ ID NO:5 or polynucleotides with at 
least 90% sequence identity to SEQ ID NO:70." (Office Action at page 6.) The basis for this 
rejection appears to be that "[s]aid polypeptides have no claimed biochemical, immunological or 
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physiological function" and that since "[pjrotein chemistry is probably one of the most unpredictable 

areas of biotechnology... the effects of sequence dissimilarities upon protein structure and function 
cannot be predicted." (Office Action at page 6.) Applicants respectfully disagree with the Examiner 
and traverse the rejection. 

The first paragraph of 35 U.S.C. §112 requires that the Specification describe how to make 
and use the claimed subject matter. That requirement has been met in the present application. In 
particular, the Specification describes how to make and use naturally-occurring polypeptide variants of 
SEQ ID NO:5 and polynucleotides encoding such variants. 

Independent claim 22 recites not only that the "variant" polynucleotides encode polypeptides 
that are at least 90% identical to SEQ ID NO:5, but also have "a naturally-occurring amino acid 
sequence" Through the process of natural selection, nature will have determined the appropriate 
amino acid sequences. Given the information provided by SEQ ID NO:5 (the amino acid sequence of 
HTRM) and SEQ ID NO:70 (the polynucleotide sequence encoding HTRM), one of skill in the art 
would be able to routinely obtain a polynucleotide encoding a polypeptide comprising "a naturally- 
occurring amino acid sequence at least 90% identical to the amino acid sequence of SEQ ID NO:5." 
Likewise for the "variant" polynucleotides defined by independent claim 29: "a polynucleotide 
comprising a naturally occurring polynucleotide sequence at least 90% identical to the polynucleotide 
sequence of SEQ ID NO:70." For example, the identification of relevant polynucleotides could be 
performed by hybridization and/or PCR techniques that were well-known to those skilled in the art at 
the time the subject application was filed and/or described throughout the Specification of the instant 
application. For example: 

The term "stringent conditions" refers to conditions which permit hybridization 
between polynucleotides and the claimed polynucleotides. Stringent conditions can be 
defined by salt concentration, the concentration of organic solvent, e.g., formamide, 
temperature, and other conditions well known in the art. In particular, stringency can be 
increased by reducing the concentration of salt, increasing the concentration of 
formamide, or raising the hybridization temperature. (Specification at page 11, lines 7- 
11) 

In one aspect, hybridization with PCR probes which are capable of detecting 
polynucleotide sequences, including genomic sequences, encoding HTRM or closely 
related molecules may be used to identify nucleic acid sequences which encode HTRM. 
The specificity of the probe, whether it is made from a highly specific region, e.g., the 5' 



119301 



7 



09/674,743 



Docket No,: PF-0509 USN 

regulatory region, or from a less specific region, e.g., a conserved motif, and the 
stringency of the hybridization or amplification (maximal, high, intermediate, or low), will 
determine whether the probe identifies only naturally occurring sequences encoding 
HTRM, allelic variants, or related sequences. (Specification at page 31, lines 17 -20) 

Probes may also be used for the detection of related sequences, and should 
preferably have at least 50% sequence identity to any of the HTRM encoding sequences. 
The hybridization probes of the subject invention may be DNA or RNA and may be 
derived from the sequence of SEQ ID NO:66-130 or from genomic sequences including 
promoters, enhancers, and introns of the HTRM gene. (Specification at page 38, lines 
10-16) 

Thus, one skilled in the art need not make and test vast numbers of polypeptides that are based 
on the amino acid sequence of SEQ ID NO:5. Instead, one skilled in the art need only screen a cDNA 
library or use appropriate PCR conditions to identify relevant polynucleotides/polypeptides that already 
exist in nature. By adjusting the nature of the probe or nucleic acid (i.e., non-conserved, conserved or 
highly conserved) and the conditions of hybridization (maximum, high, intermediate or low stringency), 
one can obtain variant polynucleotides of SEQ ID NO:70 which, in turn, will allow one to make the 
variant polypeptides of SEQ ID NO: 1 recited by the present claims. 

Accordingly, the document cited by the Examiner relating to structure-function relationships in 
proteins (Bowie et al.) is simply not germane to whether one can make and use the polypeptide variants 
recited by the present claims. Likewise, the cited document relating to alleged difficulties in assigning 
protein function based on homology comparison is not relevant to making the claimed polynucleotide 
variants. That is, regardless of the precise functional characteristics of the SEQ ID NO:5 and SEQ ID 
NO:70 variants, one can still make the claimed polynucleotide variants using the disclosure provided by 
the present Specification. The polynucleotides could then be used in, for example, diagnostic testing, 
drug discovery, expression profiling, etc. 

Furthermore, the Examiner's attention is also directed to the enclosed reference by Brenner et 
al. ("Assessing sequence comparison methods with reliable structurally identified distant evolutionary 
relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). Through exhaustive analysis of a 
data set of proteins with known structural and functional relationships and with <90% overall sequence 
identity, Brenner et al. have determined that 30% identity is a reliable threshold for establishing 
evolutionary homology between two sequences aligned over at least 150 residues. (Brenner et al., 
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pages 6073 and 6076.) Furthermore, local identity is particularly important in this case for assessing 

the significance of the alignments, as Brenner et al. further report that ^40% identity over at least 70 
residues is reliable in signifying homology between proteins. (Brenner et al., page 6076.) 

Claim 22 recites, inter alia, a polynucleotide encoding a polypeptide comprising "a naturally 
occurring amino acid sequence at least 90% identical to the amino acid sequence of SEQ ID NO:5." 
In accordance with Brenner et al, naturally occurring molecules may exist which could be characterized 
as HTRM proteins and which have as little as 30% identity over at least 150 residues to SEQ ID 
NO:5. The "90% variants" recited by the present claims have a variation that is far less than that of all 
potential HTRM proteins related to SEQ ID NO:5, i.e., those HTRM proteins having as little as 30% 
identity over at least 150 residues to SEQ ID NO:5. Therefore, one would expect the SEQ ID NO:5 
variants recited by the present claims to have the functional activities of a HTRM protein. 

While the Examiner has cited literature identifying some of the difficulties that may be involved 
predicting protein function, none suggests that functional homology cannot be inferred by a reasonable 
probability in this case. Bork, Genome Research 10:398-400 (2000); Bowie et al., Science 257:1306- 
1310 (1990); Burgess et al., Journal of Cell Biology 111:2129-2138 (1990); Lazaret al., Molecular 
and Cellular Biology 8:1247-1252 (1998). Importantly, none contradicts Brenner's basic rule that 
sequence homology in excess of 40% over 70 or more amino acid residues yields a high probability of 
functional homology as well. Brenner et al., Proceedings of the National Academy of Sciences USA 
95:6073-6078 (1998). More importantly, nor do they contradict the fact that the identification of the 
polypeptide encoded by the claimed polynucleotides using a combination of methods provides 
compelling scientific evidence that the polypeptide has the functions of a human transcriptional regulator 
molecule. At most, these articles individually and together stand for the proposition that it is difficult to 
make predictions about function with certainty. The standard applicable in this case is not, however, 
proof to certainty, but rather proof to reasonable probability. 

As set forth in In re MarzocchU 169 USPQ 367, 369 (CCPA 1971): 

The first paragraph of § 112 requires nothing more than objective enablement. 
[emphasis added] How such a teaching is set forth, either by the use of illustrative 
examples or by broad terminology, is of no importance. 
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As a matter of Patent Office practice, then, a specification disclosure which contains a 
teaching of the manner and process of making and using the invention in terms which 
correspond in scope to those used in describing and defining the subject matter sought 
to be patented must be take as in compliance with the enabling requirement of the first 
paragraph of § 112 unless there is reason to doubt the objective truth of the statements 
contained therein which must be relied on for enabling support. 

Contrary to the standard set forth in Marzocchi, the Examiner has failed to provide any 
reasons why one would doubt that the guidance provided by the present Specification would enable 
one to make and use the recited variants of SEQ ID NO:5 or SEQ ID NO:70. Hence, a prima facie 
case for non-enablement has not been established with respect to the variants of SEQ ID NO:5 or 
SEQ ID NO:70. 

For at least the above reasons, withdrawal of this rejection is requested. 

The Written Description rejection under 35 U.S.C. § 112, first paragraph 

Claims 22-24, 26 and 29 have been rejected under 35 U.S.C. § 112, first paragraph for 
alleged lack of written description of the variant polypeptides recited in the claims. In particular, the 
Office Action alleges that, while the claims encompass sequences that have at least 90% identity to 
SEQ ID NO:5 (claims 22-24 amd 26) or SEQ ID NO:90 (claim 29), corresponding sequences from 
other species, mutated sequences, allelic variants, splicae variants, etc., the specification provides 
insufficient written description to support the genus encompassed by the claim. (Office Action at page 
9.) The Examiner asserts at page 11, paragraph 2, that "absent factual evidence, a percentage 
similarity of less than 100% is not deemed to reasonably support to one skilled in the art whether the 
biochemical activity of the claimed subject matter would be the same as that of such a similar known 
biomolecule." The Examiner goes on to reiterate the argument that even a single nucleotide or amino 
acid change can destroy the function of the biomolecule. Applicants respectfully disagree with the 
Examiner and traverse the rejection. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 112, first 

paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in 
the art that, as of the filing date sought, he or she was in possession of the invention. 
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The invention is, for purposes of the "written description" inquiry, whatever is now 
claimed. Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 1 12, para. 1", published January 5, 2001, which 
provide that : 

An applicant may also show that an invention is complete by disclosure of sufficiently 
detailed, relevant identifying characteristics which provide evidence that applicant was in 
possession of the claimed invention, i.e., complete or partial structure, other physical 
and/or chemical properties, functional characteristics when coupled with a known or 
disclosed correlation between function and structure, or some combination of such 
characteristics. What is conventional or well known to one of ordinary skill in the art 
need not be disclosed in detail. If a skilled artisan would have understood the inventor 
to be in possession of the claimed invention at the time of filing, even if every nuance of 
the claims is not explicitly described in the specification, then the adequate description 
requirement is met. (footnotes omitted.) 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 

SEQ ID NO:5 and SEQ ID NO:70 are specifically disclosed in the application. Variants of 
HTRM are described, for example, at page 5, line 33 to page 6, line 19. In particular, the preferred, 
more preferred, and most preferred variants (80%, 90%, and 95% amino acid sequence similarity to 
SEQ ID NO: 5) are described, for example, at page 13, lines 23-26. Incyte clones in which the nucleic 
acids encoding the human HTRM were first identified and libraries from which those clones were 
isolated are described, for example, Table 3 of the Specification. Chemical and structural features of 
HTRM are described, for example, at Examples X and XI (page 42). Given SEQ ED NO:5, one of 
ordinary skill in the art would recognize naturally-occurring variants of SEQ ID NO:5 having 90% 
sequence identity to SEQ ID NO:5. Accordingly, the Specification provides an adequate written 
description of the recited polypeptide sequences. 

The Office Action has further asserted that the claims are not supported by an adequate written 
description because "[t]he species specifically disclosed are not representative of the genus because the 
genus is highly variant "(page 11 of the Office Action of November 18, 2003). 

Such a position is believed to present a misapplication of the law. 
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1. The present claims specifically define the claimed genus through the recitation 
of chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the recitation 

of structural features or chemical or physical properties are important factors to consider in a written 

description analysis of such claims. For example, in Fiers v. Revel, 25 USPQ2d 1601, 1606 (Fed. 

Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 

noted that the claims attempted to define the claimed DNA in terms of functional characteristics without 

any reference to structural features. As set forth by the court in University of California v. Eli Lilly 

and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the genus because it does not distinguish the claimed genus from others, 
except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. For 
example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
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DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics 
and were found not to comply with the written description requirement of 35 U.S.C. §112; i.e., "an 
mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human 
fibroblast interferon-beta polypeptide" in Fiers. In contrast to the situation in Lilly and Fiers, the 
claims at issue in the present application define polynucleotides in terms of chemical structure, rather 
than functional characteristics. For example, the "variant language" of independent claim 5?recites 
chemical structure to define the claimed genus: 

22. An isolated and purified polynucleotide sequence encoding a polypeptide selected 
from the group consisting of:...b) a naturally-occurring amino acid sequence having at 
least 90% sequence identity to the sequence of SEQ ID NO:5... 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SEQ ID NO: 5. In the present case, there is no 
reliance merely on a description of functional characteristics of the polynucleotides recited by the 
claims. In fact, there is no recitation of functional characteristics. Moreover, if such functional 
recitations were included, it would add to the structural characterization of the recited polynucleotides . 
The polynucleotides defined in the claims of the present application recite structural features, and cases 
such as Lilly and Fiers stress that the recitation of structure is an important factor to consider in a 
written description analysis of claims of this type. By failing to base its written description inquiry "on 
whatever is now claimed," the Office Action failed to provide an appropriate analysis of the present 
claims and how they differ from those found not to satisfy the written description requirement in Lilly 
and Fiers 

2. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
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evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). Through exhaustive 

analysis of a data set of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 30% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 150 residues. 
(Brenner et al., pages 6073 and 6076.) Furthermore, local identity is particularly important in this case 
for assessing the significance of the alignments, as Brenner et al. further report that ^40% identity over 
at least 70 residues is reliable in signifying homology between proteins. (Brenner et al., page 6076.) 

The present application is directed, inter alia, to HTRM proteins related to the amino acid 
sequence of SEQ ID NO:5. In accordance with Brenner et al, naturally occurring molecules may exist 
which could be characterized as HTRM proteins and which have as little as 40% identity over at least 
70 residues to SEQ ID NO:5. The "variant language" of the present claims recites, for example, 
polynucleotides encoding "a naturally-occurring amino acid sequence having at least 90% sequence 
identity to the sequence of SEQ ID NO:5" (note that SEQ ID NO:5 has 301 amino acid residues). 
This variation is far less than that of all potential HTRM proteins related to SEQ ID NO:5, i.e., those 
HTRM proteins having as little as 40% identity over at least 70 residues to SEQ ED NO:5. 

3. The state of the art at the time of the present invention is further advanced than 
at the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
Israeli application filed on November 21, 1979. Thus, the written description inquiry in those case was 
based on the state of the art at essentially at the "dark ages" of recombinant DNA technology. 

The present application has a priority date of 05/04/1999. Much has happened in the 
development of recombinant DNA technology in the years from the time of filing of the applications 
involved in Lilly and Fiers and the present application. For example, the technique of polymerase 
chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing technology has been 
developed. Large databases of protein and nucleotide sequences have been compiled. Much of the 
raw material of the human and other genomes has been sequenced. With these remarkable advances 
one of skill in the art would recognize that, given the sequence information of SEQ ID NO:5 and SEQ 
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ED NO:70, and the additional extensive detail provided by the subject application, the present inventors 

were in possession of the claimed polynucleotide variants at the time of filing of this application. 
[4], Summary 

The Office Action failed to base its written description inquiry "on whatever is now claimed." 
Consequently, the Action did not provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Fiers. In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO:5 or SEQ ID NO:70. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and proteins. In 
addition, the genus of polynucleotides defined by the present claims is adequately described, as 
evidenced by Brenner et al and consideration of the claims of the '740 patent involved in Lilly. 
Furthermore, there have been remarkable advances in the state of the art since the Lilly and Fiers 
cases, and these advances were given no consideration whatsoever in the position set forth by the 
Office Action. 

The Indefiniteness rejection under 35 U.S.C. § 112, second paragraph 

Claims 22-24 and 26 have been rejected under 35 U.S.C. § 112, second paragraph for 
alleged indefiniteness for being dependent on a non-elected claim (claim 21). Claim 22 (from which 
claims 23 and 24 depend) and claim 26 are amended herewith to fully recite the subject matter which is 
being claimed in an unambiguous manner. The basis for rejection of these claims is thereby obviated. 
Support for these amendments may be found in non-elected claim 21 and in originally filed claims 1-6 
and 9-11. No new matter is added by these amendments. Accordingly, it is respectfully requested that 
this rejection be withdrawn. 
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CONCLUSION 

In light of the above amendments and remarks, Applicants submit that the present application is 
fully in condition for allowance, and request that the Examiner withdraw the outstanding rejections. 
Early notice to that effect is earnestly solicited. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allowance of the claims, Applicants invite the Examiner to contact the undersigned at the number 
listed below. 

Applicants believe that no fee is due with this communication. However, if the USPTO 
determines that a fee is due, the Commissioner is hereby authorized to charge Deposit Account 
No. 09-0108. 

Respectfully submitted, 
INCYTE CORPORATION 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fast a. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used prosit* (13) 
to define homologous families. Their results showed thai the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs* evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROsrTE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependent of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in PIR 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a lentth- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20). but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" ( 1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28). there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29),* it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein stricture information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scop database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majoritv of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (pdbwd-b) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40%> 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The" 
highest quality domain was selected" for inclusion in"the 
database and removed from the list. Also removed frorrithc list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This p/ocess was 
repealed until the list was empty. The PDB«od/b database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or -0J% of the total 1,74?.006 ordered 
pairs. In PDBMDrB, the 2,079 domains have 53,988 relation- 
ships, representing \2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the seg program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/. and databases derived from the current version of SCOP 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins 'and reduces the 
heavy overTepresentation in the PDB of a small number of 
families (31. 32). whereas pdbwd-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms {using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
blast: (2). version 2.0al3MP. Also assessed was the Fasta 
package, version 3.0t76 (3), which provided Fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUKMS with gap penalties 
-12/-1 (7. 16). The default parameters and matrix (BLO 
SUM6?) were used for blast and wu-blast2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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KS sep " a,, ° n - w,,h a » ° f ,he homologs at the top of the 
st and unrelated proteins below. In practice, perfect separa- 
t.on >s .mposs.ble to achieve so instead one is interested in 

of ra re!af e d l i reSh0l , d ^ ^ " ,he ,a '*«< ™»« 
error rate consistent with an acceptable 

Our procedure involved measuring the coverage and error 

st ucturally determ.ned homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method 
Errors per query (EPQ). an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy These 

ffi S 0^7a,l eC rh elV 8 " ° f ,hC beneficil " '-"SofC 
ciever Operating Characteristic (ROC) plots (33 34) but 

better represent the high degrees of acfuraV required to 

sequence companson and the huge background of nonhe? 

This assessment procedure is directly relevant to practical 
sequence database searching, for i, provides precise* he 
£rTT C r ary '° Perf0rm 3 re,iable "<l«nce datable 

f™cv ,h« e EP0 mCaSUre Plac " a P remium ° n sc °«= const 
tency. that ts. it requires scores to be comparable for different 
quer.es. Cons,s«ency is an aspect which has been a rge ty 
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tenpih and percentage .denutv are quanLed. manv P a""o fZZ 
may have exactly the same alignment length and percentage idem.* 
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the relationship between reported statistical score and actual error 
rate for , d.fferem program. E-v.| ue s are reported for ke^ch 7nd 
Fast a. whereas P-values are shown for blast and wu-BLxsn. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated bv Uie u^ boW line 

If T T ? UW * " me " EP0 ,or "-n-berTand drvergs 
at h,gher values, as indicated by the lower bold line.) E-valueV from 
sseakch and facta are shown to have good agreement with E^b™ 
underesumaie the significance slightly, blast and wv-BiZr are 
overconfident, with the degree of exaggeration dependent upon u, 

SEL^T ^ f for PDB4W> " B were $,n,ilar ,0 ,h °" "» »«M 

™ m L d ! ,feren « "> n«""b" of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given staS 

ignored in previous tests but is essential for the straishrforward 
or automatic interpretation of sequence companion results 

^u" T*? " C ' ear indication °f *e confidence that 
shou d be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported bv data- 
base searching programs, if the programs' estimates are accu- 

PtT '°/ nn ?' Scoring Schemes. All of the programs 
tested could prov.de three fundamental types of scores The 
first score is the percentage identity, which may be computed 

'c I 0f ,he se< J uenc «- The second is a "raw" or 

,^1^™"™"" SCOre ' which is ,he mea$ure optimized bv 
the Smith-Waterman algorithm and is computed bv summing 
the substitution matrix scores for each position in' the alio,- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. llZ Su 
are summarized in Fig. 1. "wuu 

JZ^T* i d '° ,ily - 11,0Ugh " has been lon * ""Wished that 
k m ' IV B 3 F°° r measure < 35 >- ,nere ■ a «mmon 
S^i mb J l aUn8 J" 30 * ldcnlit y homology. 
Moreover, publications have indicated that 25% .dentin cui 

Slv aS dVi reS H h0 ' d ° 7, ?6) Wt find ,ha « these "rcshoS 
originally derived yean ago. are not supported bv present 
results. As databases have grown, so have *e pcJbilKo 
chance alignments with high identity: thus, the reported c^off 
lead to frequent errors. Fig. 2 shows one of the manv pairs of 

EST Z^JTZ SUUC,Ures th " "one«he.es? have 

E& !M ? ^ M,t ? °T er """doable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically no , significant. The prhv 
ctpal reasons percentage identity does so poorlv seem to be 

%L rma,, °? abom gaps and abouI ,he «>»*<- 

vative or radical nature of residue substitutions 

idrTmirT ^"T^ ™ F «" 3 ' we lea ™ that 30% 

identity is a reliable threshold for this database onlv for 
sequence alignments of at least 150 residues. Because' one 

iTsSKK-* pm T "f 4 "* W«ti«y over 62 residues! 
it is probably necessary for alignments to be at least 70 residua 
» i length before 40% is a reasonable threshold, for a 5£Z 
of this particular size and composition 
At a given reliability, scores based on percentage identity 

tSZ^LLt'u™ ° f ,he di " am "omologs 8 found 'by 
?he 8 ' " °H e meaSUres ,he P««»U«e identity if. 

the aligned regions without consideration of alignment length, 
hen a negligible number of distant homologs are detected 

£n,?rl , K C , HSSP eq u Uati °" improves ,he value of Percentage 
identify, but even this measure can find onlv 4% of all known 
homologs at 1% EPQ. In short, percentage identitv dS. 

nlZ c ,nf ° rmaI '°" measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1). but ln-scaling (7) provided no 
notable benefit ,n our analysis. It is necessary to be verv precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However 
it is difficult to choose appropriate thresholds because the 
pliability of a bit score depends on the lengths of.the proteins 
matched and the sue of the database. Raw score thresholds 
also are affected by matrix and gap parameters 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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confidence by more than an order of magnitude at l%EPn 
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ships have no more sequence identity than would be expected 
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■cant E-values. but 26 of these involve sequences w ith <5 0 
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CONCLUSION 
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Our results also suggest .wo further points. Firs, the E-val- 
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P rov,ded bv Blast and wu-blast, underestimate the tru" 
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extent of errors. Second, ssearch, wu-blastz and fast a 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found bv sequence com- 
parison can be distinguished with high reliability from the huse 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather it 
indicates that any relatives it might have are distant ones.** 

"Additional and updated information about this work, including 
supplementary figures, may be found at http://osjtanford.edu/sss/. 
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